The purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
HOLLOWS RATIO (area of hollows)/(area of bounding polygon)
Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and area of hollows= area of bounding poly-area of object
NUMBER OF CLASSES - 3 CAR, BUS, VAN
#Import all the necessary modules
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
#!jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
from scipy.stats import zscore
from sklearn.decomposition import PCA
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_nodeinteractivity = 'all'
df = pd.read_csv("vehicle.csv")
target = 'class'
X = df.loc[:, df.columns!=target]
y = df.loc[:, df.columns==target]
X.head()
X.info()
X.describe().T
def basic_details(df):
b = pd.DataFrame()
b['Missing value'] = df.isnull().sum()
b['N unique value'] = df.nunique()
b['dtype'] = df.dtypes
return b
basic_details(X)
X.hist(figsize=(20,20))
for i in list(X._get_numeric_data().columns):
X[i].fillna(X[i].median(), inplace=True)
basic_details(X)
sns.countplot(x='class',data=y)
y['class'].value_counts()/y.shape[0] * 100
y['class'] = pd.Categorical(y['class']).codes
y['class'].value_counts()/y.shape[0] * 100
There is a slight imbalance in the dataset, observations for 'car' class is higher compared to 'bus' and 'van'. Its good to have a balanced dataset for every class to avoid baised prediction. However, we will use the dataset as is for this analysis and not use any balancing technique.
def correlation_matrix(df):
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(25,25))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
correlation_matrix(X)
Hence, these columns can be ignored while training the model as they are highly correlated with other variables
Following variables need to be considered for training the model
fcolumns = ['compactness', 'circularity', 'distance_circularity', 'radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'elongatedness', 'pr.axis_rectangularity', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'skewness_about.2', 'hollows_ratio'],
from scipy.cluster import hierarchy as hc
corr = 1 - X.corr()
corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='complete')
plt.figure(figsize=(20,8))
dendrogram = hc.dendrogram(z,labels=corr.columns,leaf_rotation =90)
sns.pairplot(df,diag_kind='kde')
sns.set(context="paper", font="monospace")
# Create a figure instance
fig = plt.figure(1, figsize=(18, 12))
# Create an axes instance
ax = fig.add_subplot(111)
g = sns.boxplot(data=X, ax=ax, color="blue")
g.set_xticklabels(X.columns,rotation=90)
# Add transparency to colors
for patch in g.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .3))
The extreme observations in data set which resembles completely different behaviour from the rest of data point are called outliers. The outliers present in numeric feature can be dealt by following ways based on the domain knowledge
In our case we will replace the outliers by 1%/99% of feature value, this might not be the best approach.
def outlier(df,columns):
for i in columns:
quartile_1,quartile_3 = np.percentile(df[i],[25,75])
quartile_f,quartile_l = np.percentile(df[i],[1,99])
IQR = quartile_3-quartile_1
lower_bound = quartile_1 - (1.5*IQR)
upper_bound = quartile_3 + (1.5*IQR)
print(i,lower_bound,upper_bound,quartile_f,quartile_l)
df[i].loc[df[i] < lower_bound] = quartile_f
df[i].loc[df[i] > upper_bound] = quartile_l
outlier(X,X.columns)
sns.set(context="paper", font="monospace")
# Create a figure instance
fig = plt.figure(1, figsize=(18, 12))
# Create an axes instance
ax = fig.add_subplot(111)
g = sns.boxplot(data=X, ax=ax, color="blue")
g.set_xticklabels(X.columns,rotation=90)
# Add transparency to colors
for patch in g.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .3))
From the above visual plots we can see that most of the variables are highly correlated. This correlation between variables brings about a redundancy in the information that can be gathered by the data set. Thus in order to reduce noise (which could lead to the computational complexities in huge datasets) , we will use PCA to transform the original variables to the linear combination of these variables which are independent.
Based on the percentage of variation that we want to be captured in transformed data set, we will select the number of Principal Components to be considered.
Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement scales of the original features. Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it measured on different scales. Let us transform data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
cov_mat1 = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat1)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
eig_vals.sum()
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
plt.plot(var_exp)
# Ploting
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.xticks(np.arange(19))
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
The plot above clearly shows that most of the variance (53.21% of the variance to be precise) can be explained by the first principal component, second principal component bears some information (18.2%) and so on. Together, the first 7 principal components contain 95.96% of the information and first 8 principal components contain 97.19% of information. We can choose either 7 or 8 and sample ignore rest of the principal components
# Using scikit learn PCA here. It does all the above steps and maps data to PCA dimensions in one shot
from sklearn.decomposition import PCA
# NOTE - we are generating only 8 PCA dimensions (dimensionality reduction from 17 to 8)
pca = PCA(n_components=10)
data_reduced = pca.fit_transform(X_std)
data_reduced.transpose()
df_comp = pd.DataFrame(pca.components_,columns=list(X))
df_comp.head()
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma',)
#Test train split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data_reduced,y.values.ravel(),test_size=0.2,random_state=3)
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, cross_val_predict,cross_validate
from sklearn import metrics
nb_clf = GaussianNB()
scoring = {'acc': 'accuracy',
'prec_macro': 'precision_macro',
'rec_macro': 'recall_macro'}
scores=cross_validate(nb_clf, X_train,y_train, cv=5,scoring=scoring,return_train_score=True)
print(scores.keys())
print("Cross-validated scores for train accuracy:", scores['train_acc'].mean())
print("Cross-validated scores for train precision:", scores['train_prec_macro'].mean())
print("Cross-validated scores for train recall:", scores['train_prec_macro'].mean())
print("\nCross-validated scores for test accuracy:", scores['test_acc'].mean())
print("Cross-validated scores for test precision:", scores['test_prec_macro'].mean())
print("Cross-validated scores for test recall:", scores['test_rec_macro'].mean())
# Make cross validated predictions
predictions = cross_val_predict(nb_clf, X_train,y_train, cv=5)
# Train the model (a.k.a. `fit` training data to it).
nb_clf.fit(X_train,y_train)
# Use the model to make predictions based on testing data.
y_pred_nb = nb_clf.predict(X_test)
#Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm_nb = confusion_matrix(y_test,y_pred_nb)
cm_nb
from sklearn.svm import SVC
classifier_svm_kernel = SVC(C=1.0,kernel='rbf')
scoring = {'acc': 'accuracy',
'prec_macro': 'precision_macro',
'rec_macro': 'recall_macro'}
# Perform 5-fold cross validation
svm_scores = cross_validate(classifier_svm_kernel, X_train,y_train, cv=5,scoring=scoring,return_train_score=True)
print(svm_scores.keys())
print("Cross-validated scores for train accuracy:", svm_scores['train_acc'].mean())
print("Cross-validated scores for train precision:", svm_scores['train_prec_macro'].mean())
print("Cross-validated scores for train recall:", svm_scores['train_prec_macro'].mean())
print("\nCross-validated scores for test accuracy:", svm_scores['test_acc'].mean())
print("Cross-validated scores for test precision:", svm_scores['test_prec_macro'].mean())
print("Cross-validated scores for test recall:", svm_scores['test_rec_macro'].mean())
# Make cross validated predictions
predictions = cross_val_predict(classifier_svm_kernel, X_train,y_train, cv=5)
# Train the model (a.k.a. `fit` training data to it).
classifier_svm_kernel.fit(X_train,y_train)
# Use the model to make predictions based on testing data.
y_pred_svm = classifier_svm_kernel.predict(X_test)
#Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred_svm)
cm
#Comparing the predictions with the actual results
comparison = pd.DataFrame(y_test,columns=['y_test'])
comparison['y_predicted'] = y_pred_svm
comparison.head()
#Applying grid search for optimal parameters and model after k-fold validation
from sklearn.model_selection import GridSearchCV
parameters = [{'C':[0.01,0.05,0.5, 0.1,5.3], 'kernel':['rbf','linear'], 'gamma': [0.01, 0.05,0.1,0.125,0.15, 0.17, 0.5,1]}]
grid_search = GridSearchCV(estimator=classifier_svm_kernel, param_grid=parameters, scoring ='accuracy',cv=5,n_jobs=-1)
grid_search = grid_search.fit(X_train,y_train)
best_accuracy = grid_search.best_score_
best_accuracy
opt_param = grid_search.best_params_
opt_param
y_pred = grid_search.predict(X_test)
#Compute confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test,y_pred)
cm
print(classification_report(y_test, y_pred, target_names = ['bus', 'car', 'van']))
Based on the percentage of variation explained by each principle component, we choose to consider first 10 components as it explains close to 98.63% of variability. That is dimensionality reduction from 17 to 10, ignoring rest of the principle components.
Here we can see that we are not loosing much information by transforming the components to a new feature space and we are able to capture most of the variance explained by these new principle components.